Lemmatization and Morphosyntactic Tagging of Croatian and Serbian
نویسندگان
چکیده
We investigate state-of-the-art statistical models for lemmatization and morphosyntactic tagging of Croatian and Serbian. The models stem from a new manually annotated SETIMES.HR corpus of Croatian, based on the SETimes parallel corpus. We train models on Croatian text and evaluate them on samples of Croatian and Serbian from the SETimes corpus and the two Wikipedias. Lemmatization accuracy for the two languages reaches 97.87% and 96.30%, while full morphosyntactic tagging accuracy using a 600-tag tagset peaks at 87.72% and 85.56%, respectively. Part of speech tagging accuracies reach 97.13% and 96.46%. Results indicate that more complex methods of Croatian-toSerbian annotation projection are not required on such dataset sizes for these particular tasks. The SETIMES.HR corpus, its resulting models and test sets are all made freely available.
منابع مشابه
The SETimes.HR Linguistically Annotated Corpus of Croatian
We present SETIMES.HR— the first linguistically annotated corpus of Croatian that is freely available for all purposes. The corpus is built on top of the SETIMES parallel corpus of nine Southeast European languages and English. It is manually annotated for lemmas, morphosyntactic tags, named entities and dependency syntax. We couple the corpus with domain-sensitive test sets for Croatian and Se...
متن کاملParsing Croatian and Serbian by Using Croatian Dependency Treebanks
We investigate statistical dependency parsing of two closely related languages, Croatian and Serbian. As these two morphologically complex languages of relaxed word order are generally under-resourced – with the topic of dependency parsing still largely unaddressed, especially for Serbian – we make use of the two available dependency treebanks of Croatian to produce state-of-the-art parsing mod...
متن کاملNew Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian
In this paper we present newly developed inflectional lexcions and manually annotated corpora of Croatian and Serbian. We introduce hrLex and srLex—two freely available inflectional lexicons of Croatian and Serbian—and describe the process of building these lexicons, supported by supervised machine learning techniques for lemma and paradigm prediction. Furthermore, we introduce hr500k, a manual...
متن کاملMachine Learning of Morphosyntactic Structure: Lemmatizing Unknown Slovene Words
Automatic lemmatization is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma (base form) to each word in a running text is not trivial, since for instance, nouns inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, sin...
متن کاملCroatian Lemmatization Server
The need for lemmatization in inflectionally rich languages is indisputable: it is applicable for the whole range of procedures — from textsearch, up to parsing. From two predominant approaches to lemmatization: 1) algorithmic (generally rule-based and realized with FSA) and 2) relational (generally data-driven and realized with databases), this paper opted for the latter. The reason is that fo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013